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Abstract With the development of Natural Language 
Processing (NLP), more and more systems want to 
adopt NLP in User Interface Module to process user 
input, in order to communicate with user in a natural 
way. However, this raises a speed problem. That is, if 
NLP module can not process sentences in durable time 
delay, users will never use the system. As a result, sys- 
tems which are strict with processing time, such as dia- 
logue systems, web search systems, automatic customer 
service systems, especially real-time systems, have to 
abandon NLP module in order to get a faster system 
response. This paper aims to solve the speed problem. 
In this paper, at first, the construction of a syntactic 
parser which is based on corpus machine learning and 
statistics model is introduced, and then a speed problem 
analysis is performed on the parser and its algorithms. 
Based on the analysis, two accelerating methods, Com- 
pressed POS Set and Syntactic Patterns Pruning, are 
proposed, which can effectively improve the time effi- 
ciency of parsing in NLP module. To evaluate different 
parameters in the accelerating algorithms, two new fac- 
tors, PT and RT, are introduced and explained in detail. 
Experiments are also completed to prove and test these 
methods, which will surely contribute to the application 
of NLP. 

Keywords: Parsing Algorithm, Evaluation, Corpus 
Learning, Question Answering, Natural Language Pro- 
cessing 

1 Introduction 

Natural Language Processing (NLP) is one of the 
most important fields in Artificial Intelligence re- 
searches, and it is applied more and more in appli- 
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cation systems. For example, NLP could be used in 
Question Answering (QA) systems to understand 
users' natural language inputs, and communicate 
with users in a natural way such as LUNAR [1] 
and some service systems [2]. These applications 
have greatly improved the way users interact with 
computer systems and overcome the disadvantages 
of traditional Q A systems which use pattern match- 
ing algorithms, for example ALICE [3]. 

However, with the development of NLP technol- 
ogy, a big problem has emerged. Most researchers 
spend a lot of time thinking of how to improve the 
precision of Part-of-Speech (POS) taggers and syn- 
tactic parsers, but there are few researches on how 
to save CPU time in tagging and parsing without 
precision decrease. Actually nowadays, NLP is ap- 
plied more and more in real-time QA systems, such 
as dialogue, web search, cell phone and PDA etc. 
[4]. As a result, the processing time problem be- 
comes more and more important for NLP appli- 
cations, because users need the responses to their 
requests in an acceptable length of time. In fact, 
the speed problem is the very reason why most QA 
systems choose pattern matching algorithm but not 
NLP methods. 

Then, how to accelerate the parsing speed of 
syntactic parser? A NLP system always includes 
several parts, such as a stemmer module, a word 
tagging module, and a syntactic parsing module 
etc. Many algorithms have been proposed for these 
modules. As we know, the syntactic parsing takes 
most of the processing time. So, improving syntac- 
tic parsing is one of the most important methods, 
and the optimization of other modules is also nec- 
essary 

Syntactic patterns are needed in syntactic pars- 
ing module. But it is possible for humans to con- 
struct a syntactic pattern. Firstly, it is hard to 
define a large amount of syntactic patterns. Sec- 



ondly, it is impossible to decide the probability 
of each pattern's appearance. So corpus machine 
learning algorithm would be the best way to gen- 
erate syntactic pattern dictionary. In this paper, 
we use Penn Corpus [5] developed by Penn Uni- 
versity in our research. Penn Treebank project [6] 
produces skeletal parses based on an initial POS 
tagging showing rough syntactic and semantic in- 
formation on about 2.5 million English words. 

In the rest of this paper, we will introduce 
the problems which exist in constructing syntac- 
tic parser first. Then an improved corpus learning 
algorithm will be proposed to improve the time ef- 
ficiency of parsing. To evaluate the time efficiency 
of parsing, two new evaluation factors will be in- 
vented. Also experiments will be done to prove 
these algorithms. At the end of this paper, conclu- 
sions and future work arc summarized. 



2 Syntactic Parser Construc- 
tion 

In this section, we will introduce how to construct 
a parser which can learn from corpus, discuss how 
to parse sentences, and analyze the reason why the 
speed problem exists. 

2.1 Corpus Machine Learning 

Syntactic pattern learning generates syntactic pat- 
tern dictionary through machine learning from 
tagged and parsed corpus. In the learning process, 
all the appeared patterns should be extracted from 
corpus, and also their appearance counts and prob- 
abilities should be recorded for the further process- 
ing. 

In this paper, N stands for nonterminals, like 
"S" , "NP" , "VP" , and N* means j - th nonterminal 
in the nonterminal set. 

For each syntactic pattern — ► (, where iV- 7 
and C are both Part- Of- Speech (POS), C{W -> Q 
is used to record appearance count of the pattern, 
and P(N : > — > () represents its appearance proba- 
bility. The following Formula 1 represents the rela- 
tionship between the two variables: 



p{w -> c) 



C(W -> C) 



(l) 



In this equation, 7 e R and R is the full POS 
Pattern Set. So £ 7 C(N j -» 7) stands for the total 
amount of all the possible patterns which have the 
same left N j . 



2.2 Parsing and Speed Problem 

Syntactic Parsing is defined to generate a syntactic 
tree form a given sentence. For example, we can 
use chart parsing algorithm to parse sentences, for 
more details, please see [7], [8] and [9]. 

In the function "predictor" of chart parsing, pat- 
terns with the required left side , like "iV 3 — > 7" , 
are all added into chart to predict next match- 
ing pattern. However, if the system has a large 
syntactic pattern dictionary, a lot of patterns will 
be added, including both important patterns and 
unimportant patterns which have low probability 
to appear. Actually in most cases, the unimportant 
patterns will contribute nothing to the parsing tree. 
In other words, it is most likely that the unimpor- 
tant patterns are not part of the syntactic tree, but 
they spend most processing time in parsing. As a 
result, the system will spend much time in process- 
ing meaningless patterns and run very slowly. 

An efficient method to accelerate speed of parser 
is to compress patterns set (combine similar pat- 
terns) and delete some unimportant patterns from 
the dictionary. But which patterns should be 
deleted or kept back is really a big problem, and 
it will be the main issue of the next section. 



3 Accelerating Methods 

In parsing experiments on Penn Treebank, for ex- 
ample using chart parsing, it is found that parsers 
can not parse sentences in durable time, because 
too many POS, nonterminal and syntactic pattern 
types have been generated. Obviously it is unac- 
ceptable for real-time systems. 

There are mainly two solutions to this problem: 
using Compressed POS Set and Patterns Pruning. 
In this section, both the two algorithms will be dis- 
cussed, and experiments will be performed to prove 
the algorithms. 

3.1 Compressed POS Set 

Compressed POS Set is a set of POS in which some 
POS in the full POS set have been combined, in or- 
der to decrease the number of different POS types. 
For example, we can combine "NNS" and "NNP", 
so patterns "NP -» NNS" and "NP -> NNP" will 
be combined into "NP — > NN" . In syntactic pat- 
terns, both terminal characters (e.g. NNS, VBD) 
and nonterminal characters (e.g. NP, VP) are used, 
so we have to compress both of them to decrease 
the amount of pattern types and then accelerate 
the parsing speed. 
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Table 1: Compressed POS Set 



Items 


Full 


Compressed 


Terminal types 


46 


27 


Dimension of HMM Array 


46 x 46 


27x27 


Tag-tag pair types 


1003 


341 


Word-tag pairs 


12726 


11787 


Pattern types 


8001 


4947 


Nonterminal types 


659 


232 


Time elapsed (ms) 


60515 


50234 



Table 2: Comparison of Different POS Sets 



First, we combine terminals. For Penn Treebank 
style POS Set, compressed POS Set in table 1 is 
applied to decrease POS number. The POS in col- 
umn "Original" will be combined into the POS in 
column "Compressed" . Thanks to the effect of this 
table, the number of terminals has decreased from 
46 to 27. 

Second, we should combine nonterminals. For 
example, "WHNP-22 -> WDT", "WHNP-23 -» 
WDT", and "WHNP-24 WDT" are essentially 
the same, so they can be combined into one pattern 
"WHNP -» WDT". So do patterns "NP-SBJ-33 -» 
DT NN" , "NP-SBJ-35 -> DT NN" and so on. 

Table 2 shows the statistical data of differences 
between using full POS set and compressed POS 
set. These data are based on Penn Treebank 10% 
version, in which there are total 10959 words and 
94200 tag-tag pairs. In our experiments, it shows 
that the amount of tag-tag pair types decreases to 
341 using Compressed POS Set, which is only 34% 
of full POS set. The amount of pattern types de- 
creases to 4947, which is 61.8% of full POS set. 
The amount of nonterminal types decreases to 232, 
which is 35.2% of full POS set. The time elapsed 
in learning decreases by 16.7%. Memory and disk 
space occupied by learning result have also greatly 
decreased. All these data prove that Compressed 
POS Set can effectively improve tagging and pars- 
ing speed of a parser in a NLP module. 
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PA: Appearance Times of remaining Patterns. 
PT: amount of remaining Patterns Types. 
NT: amount of Nonterminal Types. 

Table 3: SPP on Appearance Times Threshold 

3.2 Syntactic Patterns Pruning 

Syntactic Patterns Pruning (SPP) is to delete some 
unimportant patterns from pattern dictionary in 
order to save parsing time. 

Compared with compressed POS Set, SPP is 
much more important for accelerating parser speed. 
In parsing process, the seldom appearing patterns 
waste much CPU time, but contribute nothing to 
improving precision and recall. So in the case that 
is strict with processing time and less important 
with precision, precision could decrease a little by 
SPP in order to decrease the time elapsed in pars- 
ing. That is a balance between precision and speed. 
Actually, in most cases, users input short sentences 
instead of long sentences or complex sentences in 
Penn Treebank, so the parsing precision will not 
decrease greatly. 

There are mainly three ways for SPP, which will 
be discussed as follows. 

3.2.1 SPP on times thresholds 

This method defines a threshold of pattern's ap- 
pearance times, and prunes patterns whose appear- 
ance times are less than the threshold. Table 3 
shows the relationship between the threshold value 
and the amount of pattern types after pruning. 
And the relationship is also depicted in Figure 1 
according to the data in the table. 

According to the data in the Table 3, along with 
the elevation of threshold, the amount of both pat- 
tern types and nonterminal types decrease obvi- 
ously, but the amount of patterns appearance times 
decreases only a little. For example, at the point 
N = 50, the amount of pattern types decreases to 
3.23% of all pattern types, and the amount of non- 
terminal types decreases to 15.9% of all types, but 
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Figure 1: SPP on Appearance Times Threshold 

the amount of pattern appearance times only de- 
creases to 80% of all patterns. 

The reason is that many patterns seldom appear, 
maybe only one or two times, and these patterns 
can not greatly improve the precision of the syn- 
tactic parser, but waste a lot of processing time. 
As a result, these patterns should be removed from 
pattern dictionary in order to improve speed. 

In Figure 1, in N section [15, 50], the curve starts 
to change much more smoothly. Our experiment in 
the later section shows that the precision of the 
parser is still acceptable at the point N = 50. 

After pruning, the remaining patterns are 
mainly like "NP — > 7" and "VP — > 7" , because 
NP and VP appear in the corpus much more fre- 
quently than other nonterminals do. 

3.2.2 SPP on probability threshold 

This method defines a threshold of pattern appear- 
ance probability, and prunes patterns whose ap- 
pearance probability is smaller than the threshold. 
Table 4 shows the relationship between the thresh- 
old and the amount of pattern types. And the rela- 
tionship is also shown in Figure 2 according to the 
data in the table. 

According to the data in the Table 4, along with 
the elevation of threshold, the amount of pattern 
types decreases greatly, and the amount of patterns 
appearance times also decreases, but the amount 
of nonterminal types hardly decreases, especially 
in the section [0, 10%] where it does not decrease 
at all. For example, at the point P — 10%, the 
amount of pattern types decreases to 7.43% of all 
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Table 4: SPP on Probability Threshold 
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Figure 2: SPP on Probability Threshold 

pattern types, and the amount of patterns appear- 
ance times decreases to 47.7% of all patterns, but 
the amount of nonterminal types does not decrease. 

The reason is that, the sum of the probabilities 
of all the patterns with the same left side equals to 
1: 

£P(JW-0 = 1 (2) 

C 

As a result, when pruning on appearance prob- 
abilities threshold, if the threshold P is small, non- 
terminals will not be pruned, such as at the point 
P = 10%. But when the threshold is elevated, the 
patterns with the same left side and different right 
side, in which different constitutions of right side 
appear comparatively, will be pruned first. Unfor- 
tunately, these patterns are always the most com- 
mon and important patterns. For example, an 
unimportant pattern only appears once, then its 
probability is 100%, and it will not be pruned. 
So, nonterminals missing should be avoided if this 
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Table 5: Mixed SPP 

method is adopted. In other words, a low threshold 
should be defined. 

In the Figure, in P section [5, 10], the curve starts 
to change much more smoothly. Our experiment in 
later section shows that the precision of syntactic 
parser is still acceptable at point P = 5. 

3.2.3 Mixed SPP 

When we use SPP on appearance probability 
threshold, a great number of syntactic patterns 
have to be reserved in order not to miss any non- 
terminals, which will slow down speed of parser. 
So Mixed SPP method, which prunes on both ap- 
pearance times threshold and probability threshold, 
could be adopted to keep the advantages of both 
methods. 

For example, if patterns which appear less than 
30 times or have a probability of less than 5% are 
pruned, there are 44 nonterminals and 44226 pat- 
terns left, which belong to 79 pattern types. In 
this case, a syntactic parser can hardly correctly 
parse very long sentences, but can effectively parse 
short sentences, which actually are most frequently 
used by users, very fast and precisely. For example, 
simple sentences could be parsed correctly, such as 
"The journal will report events of the past century" 
and "I want to find a job" etc. 

Table 5 shows the relationship between the val- 
ues of two thresholds and the amount of pattern 
types. 

4 Evaluation 

In previous sections, it has been demonstrated 
that Compressed POS Set and SPP can accelerate 
parsers' speed very effectively. But because of the 
different definitions of thresholds N and P, parsers 
with different thresholds will perform differently A 
certain evaluation method should be proposed to 
evaluate different < N,P > pairs for Mixed SPP 
Method. 

In this section, two new evaluation factors are 



defined to describe parsers' efficiency The higher 
its value is, the faster and more accurate the parser 
is. 



4.1 PT Factor 

As what has been discussed in the previous sec- 
tions, in order to accelerate parser, we should keep 
a balance between speed and precision, through us- 
ing Compressed POS Set and SPP algorithm. 

To get a high score in efficiency evaluation, 
parsers should process sentences correctly as many 
as possible in a time unit. Parameter [i is defined 
to represent this concept: 



A* = 



C+_ 
T 



(3) 



In the formula, C + represents the amount of syn- 
tactic patterns correctly parsed by parser, and T 
represents the time elapsed in parsing. 

Assume that parameter C represents the total 
amount of parsed syntactic patterns, including both 
correctly and incorrectly parsed patterns. Because 
tests on different parsers are based on the same 
test set, C values are equal. As a result, the ratio 
of Hi, Hi equals to the following: 



Ti C+ 



C+ T 2 C 
TiC C+ 



(4) 



Besides, the precision of a system, defined as P, 
should be computed as the following: 



C+_ 

c 



(5) 



So, formula 4 is transformed into the following 
formula: 



Mi 



Pi .Pi 
M2 Ti 1 T 2 



(6) 



Here, magnitude of \i is decided by the ratio of 
precision and time. So, factor PT is proposed to 
evaluate time efficiency of a parser, which is defined 
as the following: 



PT 



P 
T 



(7) 



where P represents precision of parsing, and T 
represents time elapsed in parsing. An efficient 
parser should be of higher PT value. 



4.2 RT Factor 

Obviously, PT factor can evaluate parsers effec- 
tively. However, this situation exists: as a result of 
too much pruning, along with great decrease of time 
elapsed, precision also greatly decreases. In this sit- 
uation, PT value is nearly the same as parser with 
high precision. To solve this problem, the balance 
between recall and speed should also be evaluated. 
That means, parser should correctly recall as many 
syntactic patterns as possible in a time unit. Vari- 
able A is defined to represent this concept: 



A 



cl 

T 



(8) 



where, C+ represents the amount of syntactic 
patterns correctly recalled by parser, and T repre- 
sents time elapsed in parsing. Also, tests on dif- 
ferent parsers are based on the same test set, so C 
values are equal. As a result, the ratio of Ai,A 2 
equals to the following: 



Ai 
A 2 



C+ T 2 _ C+ T 2 C 



(9) 



The recall rate of the system, which is defined as 
R, should be computed as the following: 



R 



cl 
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(10) 



So, formula 9 is transformed into the following 
formula: 



Ai _ R\ jR 2 
A 2 Ti T 2 



(11) 



Here, magnitude of A is decided by the ratio of 
recall rate and time elapsed. So, factor RT is also 
proposed to evaluating time efficiency of parser, 
which is defined as the following: 



RT 



R 
T 



(12) 



In this formula, R represents recall rate of parser, 
and T represents time elapsed in parsing. RT rep- 
resents the amount of correctly recalled patterns in 
a time unit. An efficient parser should be of higher 
RT value also, not only higher PT value. 

The following example shows how to compute 
PT and RT values. A test, in which compressed 
POS set is used and the patterns that appears less 
than 50 times has been pruned, has been performed 
on the first 5854 lines of Penn Treebank. The preci- 
sion of syntactic parsing is: 247/691 = 35.7%, and 
the recall rate is: 247/761 = 32.5%, 4048 seconds 



elapsed in the test. So PT and RT values could be 
calculated as follows: 

PT = 35.7/4048 = 0.0088 (13) 

RT = 32.5/4048 = 0.0080 (14) 

Our experiment in the next section will discuss 
how to choose parameters in order to increase PT 
and RT values. 



5 Experiments 

Our experiments are performed on 10% version of 
Penn Treebank, obtained from NLTK [10]. Com- 
pressed POS Set is adopted in all the following ex- 
periments. 

Firstly, precision of our tagger is tested on the 
first 9988 lines of Penn Treebank, and the result is 
9479/9784 = 96.88%. 

Then, parsers with different thresholds are 
tested on the first 5854 lines: 

(1) Threshold N=50, Precision = 247/691 = 
35.7%, Recall = 247/761 = 32.5%. 

(2) Threshold N=60, Precision = 235/662 = 
35.5%, Recall = 235/761 = 30.9%. 

(3) Threshold P=60%, Precision = 0, Recall = 

0. 

(4) Threshold P=5%, Precision = 30/94 = 
31.9%, Recall = 30/761 = 3.9%. 

(5) Threshold N=10, P=2%, Precision = 72/208 
= 34.6%, Recall = 72/761 = 9.5%. Although test 
(5) gets a lower recall than (1), its parsing speed is 
far faster than (1). 

All the related data of the tests is summarized 
in Table 6 and Table 7. Of the five tests, test (5), 
in which the amount of pattern types is about the 
same as other tests or even less than the others, 
has nearly both the highest PT and RT values. So 
thresholds in test (5) are the best parameters in the 
test for system which is strict with processing time. 

But in the cases that systems are not strict with 
processing time, the pruning algorithm with lower 
PT and RT value but higher precision and recall 
should be adopted. For example, in the case of ig- 
noring small differences of precision and recall be- 
tween test (1) and test (2), test (2) is better than 
test(l). 

Actually, in real systems, users always input 
short sentences, so precision and speed will be much 
higher than these experiments. 
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Table 6: PT Value after Pruning 
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Table 7: RT Value after Pruning 



6 Future Work 

In the future, more accelerating algorithms should 
be proposed to improve parser, which will greatly 
promote the application of NLP technology, espe- 
cially in real-time systems. These methods may 
include: 

(1) Determine the importance level of syntactic 
patterns by their content and constitution. For ex- 
ample some patterns frequently appear in written 
English corpus but seldom appear in oral English 
or user input, so these patterns should be removed 
from the dictionary and that will not influence pre- 
cision and recall. 

(2) Instead of chart parsing, a new parsing algo- 
rithm may be proposed for the new speed demands. 
In the new algorithm, the disadvantages brought by 
step "predict" should be avoided. 

As to the evaluation, actually, in different ap- 
plication background, different evaluation factors 
should be defined for the specified circumstance. 
In other words, Precision and Recall arc not the 
only things we should pay attention to. 
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